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Abstract  -  Ensemble  methods  provide  a  principled  frame¬ 
work  for  building  high  performance  classifiers  and  repre¬ 
senting  many  types  of  data.  As  a  result,  these  methods  can 
be  useful  for  making  inferences  in  many  domains  such  as 
classification  and  multi-modal  biometrics.  We  introduce  a 
novel  ensemble  method  for  combining  multiple  representa¬ 
tions  (or  views).  The  method  is  a  multiple  view  general¬ 
ization  of  AdaBoost.  Similar  to  AdaBoost,  base  classifiers 
are  independently  built  from  each  representation.  Unlike 
AdaBoost,  however,  all  data  types  share  the  same  sampling 
distribution  as  the  view  whose  weighted  training  error  is  the 
smallest  among  all  the  views.  As  a  result,  the  most  con¬ 
sistent  data  type  dominates  over  time,  thereby  significantly 
reducing  sensitivity  to  noise.  In  addition,  our  proposal  is 
provably  better  than  AdaBoost  trained  on  any  single  type 
of  data.  The  proposed  method  is  applied  to  the  problems 
of  facial  and  gender  prediction  based  on  biometric  traits  as 
well  as  of  protein  classification.  Experimental  results  show 
that  our  method  outperforms  several  competing  techniques 
including  kernel-based  data  fusion. 

Keywords:  AdaBoost,  data  fusion,  stacking,  semi-definite 
programming. 

1  Introduction 

Classifiers  employed  in  real  world  scenarios  must  deal 
with  various  adversities  such  as  noise  in  sensors,  intra-class 
variations,  restricted  degrees  of  freedom  and  in  some  cases 
spoof  attacks.  It  is  often  helpful  to  develop  classifiers  that 
rely  on  data  from  various  sources  for  classification.  Such 
algorithms,  known  in  the  literature  as  multimodal  classifi¬ 
cation  algorithms  require  a  clever  way  of  fusing  the  various 
sources  of  information.  A  robust  data  fusion  strategy  com¬ 
pensates  for  any  errors  in  the  feature  extraction  process  due 
to  the  adversities  faced  by  a  classifier. 

'The  research  described  here  and  the  conclusions  are  that  of  the  au¬ 
thors.  The  document  does  not  in  any  way  represent  the  policies  of  the  MIT 
Lincoln  Laboratory  or  US  Air  Force  Research  Laboratory. 
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We  are  given  a  set  of  training  examples:  X  = 
{(xi,  2/i ),  (x2,  j/2 )*  •  •  •  ,  (x„,  2/n)},  and  M  disjoint  features 

for  each  example  Xi  =  { xj,x f . atf1},  where  xj  €  JU3 , 

and  x/i  g  y  =  {— 1,+1}.  Each  member  xj  in  the  set  xt 
is  known  as  a  view  of  example  .c, .  In  this  case,  it  is  the 
jth  view  of  example  .c, .  For  instance,  when  three  sensors 
such  as  radar  (Radr),  infrared  (IR)  and  visible  (Vis)  are  used 
to  capture  an  event,  each  example  Xi  can  be  thought  of  as 
a  set  of  three  views,  each  consisting  of  three  features  ob¬ 
tained  from  the  intensities  of  radar,  infrared  and  visible  com¬ 
ponents.  In  this  case,  the  number  of  views  will  be  three 
and  we  can  represent  the  three  views  of  the  example  xt  as 
{xfadr-  xIR,xY*s}.  We  assume  that  examples  (xi,yf)  are 
drawn  randomly  and  independently  according  to  a  fixed  but 
unknown  probability  distribution  D  over  X  x  y.  Here  the 
input  space  X  is  where  q  =  (fa¬ 

in  this  paper,  we  present  a  novel  method  for  fusing  mul¬ 
tiple  representations  of  data  with  boosting.  Our  method  is 
a  multiple  view  generalization  of  AdaBoost  [1],  Similar  to 
AdaBoost,  base  classifiers  are  independently  built  from  each 
view.  Furthermore,  all  views  share  the  same  sampling  dis¬ 
tribution  as  the  view  whose  weighted  training  error  is  mini¬ 
mum  among  all  the  views.  This  allows  the  most  consistent 
data  type  (view)  to  dominate  over  time,  thereby  significantly 
reducing  sensitivity  to  noise.  In  addition,  since  the  final  en¬ 
semble  contains  classifiers  that  are  trained  to  focus  on  differ¬ 
ent  views  of  the  data,  better  generalization  performance  can 
be  expected.  We  support  this  argument  with  empirical  eval¬ 
uation  performed  on  FERET  facial  images  [2]  and  CYGD 
genomic  data  [3], 

2  Related  Work 

Considerable  research  in  the  pattern  recognition  field  is 
focused  on  fusion  rules  that  aggregate  the  outputs  of  the  first 
level  experts  and  make  a  final  decision.  Various  techniques 
for  fusion  of  expert  observations  such  as  linear  weighted 
voting,  the  Naive  Bayes  classifiers,  the  kernel  function  ap¬ 
proach,  potential  functions,  decision  trees  or  multilayer  per- 
ceptrons  have  been  proposed  in  recent  years  [4,  5],  Other  ap¬ 
proaches  are  based  on  bagging,  boosting,  and  arching  clas- 
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sifters  [1,  6,  7].  Comprehensive  surveys  of  various  classifier 
fusion  studies  and  approaches  can  be  found  in  [8,  9,  10]. 

In  [11]  Wolpert  proposes  stacked  generalization  that  is 
defined  as  any  scheme  that  feeds  data  from  one  set  of  classi¬ 
fiers  to  another  before  making  a  final  decision.  The  data  that 
feeds  up  the  net  of  the  classifiers  is  provided  by  multiple  par¬ 
titionings  into  two  subsets  of  the  original  learning  set.  These 
pairs  of  subsets  are  further  employed  to  gather  information 
about  the  bias  of  the  original  classifier(s)  with  respect  to  the 
learning  set.  The  bias  of  the  constituent  classifiers  with  re¬ 
spect  to  the  learning  set  is  estimated  and  corrected  by  the 
stacked  generalization.  In  information  fusion,  it  is  equiv¬ 
alent  to  forming  a  linear  combination  of  the  classification 
results  of  the  constituent  classifiers. 

Lanckriet  et  al.  [12]  introduce  a  kernel-based  data  fu¬ 
sion  approach  to  protein  function  prediction  in  yeast.  The 
method  combines  multiple  kernel  representations  in  an  op¬ 
timal  fashion  by  formulating  the  problem  as  a  convex  op¬ 
timization  problem  that  can  be  solved  using  semidefinite 
programming.  That  is,  given  a  set  of  kernel  matrices 
1C  =  {Ki,  K2, ...,  Km},  the  optimal  combination  K  = 
Y^iLi  HiKi  can  be  obtained  by  optimizing  coefficients  /i, 
through  semi-definite  programming. 

There  is  a  close  relationship  between  our  technique  and 
that  of  Viola  and  Jones  [13].  If  we  have  a  single  view  and 
base  classifiers  are  allowed  to  include  features  as  well,  then 
both  techniques  reduce  to  standard  AdaBoost.  When  noise 
exists,  however,  the  two  techniques  diverge.  In  the  case  of 
Viola  and  Jones,  it  behaves  exactly  like  AdaBoost.  Noise 
forces  the  boosting  algorithm  to  focus  on  noisy  examples, 
thereby  distorting  the  optimal  decision  boundary.  On  the 
other  hand,  our  approach  restricts  noise  to  individual  views, 
which  has  a  similar  effect  to  that  of  placing  less  mass  of  sam¬ 
pling  probability  on  these  noisy  examples.  This  is  the  key 
difference  between  the  two  techniques.  By  restricting  noisy, 
thus  “difficult,”  examples  to  individual  views,  the  mass  of 
sampling  probability  on  these  examples  will  be  restricted  in 
our  technique.  This  is  possible  because  probability  mass 
will  be  determined  by  those  views  having  less  noise. 

3  Boosting  Using  Shared  Sampling 
Distribution 

AdaBoost  has  been  shown  to  improve  the  prediction  ac¬ 
curacy  of  weak  classifiers  using  an  iterative  weight  up¬ 
date  process  [1].  The  technique  combines  weak  classifiers 
(classifiers  having  classification  accuracy  slightly  better  than 
chance)  in  a  weighted  vote  fashion  giving  an  overall  strong 
classifier.  Detailed  explanation  of  the  AdaBoost  algorithm 
is  skipped  here  for  brevity,  interested  readers  may  refer 
to  [14,  15,  16]  for  more  on  AdaBoost. 

One  of  the  ways  boosting  may  be  used  for  classifier  fusion 
would  be  to  run  boosting  separately  on  each  view,  obtain 
separate  ensembles  for  each  view,  and  then  take  a  majority 
vote  among  the  ensembles  when  presented  with  a  test  exam¬ 
ple.  In  this  case,  separate  training  of  classifiers  is  needed  for 


each  view  and  the  sampling  distributions  of  the  data  points 
are  also  independent. 

We  propose  a  different  yet  simple  approach.  Our  ap¬ 
proach  performs  separate  training  for  each  view.  However, 
the  weight  distribution  of  training  examples  is  shared  among 
all  the  views  at  each  boosting  round.  The  main  steps  of  the 
proposed  algorithm  are  shown  in  Algorithm  1. 

Algorithm  1:  Boosting  With  Shared  Sampling  Distribu¬ 
tion  (BSSD) 

1.  Input:  zJ0  =  {a^,yi)}f=1,  j  =  1,  •  •  ■  ,M. 

2.  Initialization:  W\  =  {tui(i)  =  ^}”=  x. 

3.  For  k  =  1  to  kmax 

(a)  Sample  z3k  from  zJ0  using  the  distribution  Wk ■ 

(b)  Compute  hypothesis  hi  from  zi  for  each  view  j. 

(c)  Calculate  error  eJk:  eJk  =  Pi~Wk  IKSA)  ^  Vi ] 

(d)  If  for  each  view:  {eJk}jL1  <  0.5,  select  h*k  corre¬ 
sponding  to  e*k  =  min,  {£]'.}. 

(e)  Calculate  ak  = 

c/e 

(f)  Update  Wk+i(i)  =  x  e~hk^Xi'>Vi°‘k,  where 

k 

Zk  is  a  normalizing  factor. 

4.  Output:  F(x)  =  Efcr  akhUx*)- 

5.  Final  hypothesis:  H(x)  =  sign(F(x)). 

Input  to  the  algorithm  are  the  M  views  of  n  training  ex¬ 
amples.  The  algorithm  produces  as  output  a  classifier  that 
fuses  data  from  all  the  views.  In  the  initialization  step,  all 
the  views  for  a  given  training  example  are  initialized  with 
the  same  weight.  Notice  that  our  algorithm  works  in  multi¬ 
modal  fusion  where  data  types  might  not  be  compatible  or 
of  fixed  size  vectors.  We  only  require  that  the  number  of 
training  examples  from  each  modality  be  the  same. 

3.1  Simple  Illustration 

We  use  a  simple  two  class  example  in  four  dimensions  to 
help  explain  how  BSSD  works  and  highlight  the  difference 
between  AdaBoost  and  BSSD.  This  example  is  taken  from 
the  Iris  data  [17]  with  two  classes  (Versicolour  and  Vir- 
ginica).  We  randomly  select  10  examples  from  each  class. 
Thus,  there  are  total  20  examples,  where  examples  from  1 
to  10  are  in  class  Versicolour  and  examples  from  11  to  20 
are  in  class  Virginica.  For  BSSD,  there  are  two  views:  sepal 
and  petal.  Each  view  has  two  attributes.  For  the  sepal  view, 
we  have  sepal  length  and  sepal  width,  while  for  the  petal 
view,  we  have  petal  length  and  petal  width.  Figure  1  shows 
the  two  views. 

To  make  the  problem  more  interesting,  we  randomly 
added  30%  noise  to  each  view  independently  by  “flipping” 
the  label  from  one  class  to  another.  For  the  first  view,  the 


Figure  1:  Two  views  of  the  Iris  data. 


noisy  examples  are:  3,  4,  9,  12,  14  and  19.  For  the  second 
view,  they  are:  5,  8,  9,  12,  16  and  18.  For  the  illustration 
purpose,  linear  classifiers  with  weighted  least  squares  fit  are 
used  as  base  classifiers. 


Figure  2:  Sampling  weights.  Left  column:  Winning  views 
along  with  decision  boundaries.  Middle  column:  Sam¬ 
pling  weights  computed  by  BSSD.  Right  column:  Sampling 
weights  computed  by  AdaBoost. 

We  performed  50  boosting  rounds.  The  left  column 
in  Figure  2  shows  the  winning  views  along  with  decision 
boundaries  computed  by  BSSD,  while  the  middle  column 
shows  the  shared  sampling  weights  for  the  first  five  itera¬ 
tions.  After  the  first  boosting  round,  the  first  (sepal)  view  is 
the  winning  view.  The  base  classifier  mislabels  example  14. 
Thus,  its  weight  increases,  while  the  weights  of  the  rest  de¬ 
crease.  At  the  next  boosting  round,  the  second  (petal)  view 
is  the  winner  whose  base  classifier  mislabels  examples  5,  8, 
12,  16,  and  18,  but  correctly  labels  example  14.  As  a  re¬ 
sult,  the  sampling  weights  for  examples  5,  8,  12,  16  and  18 
are  increased,  while  the  weight  for  example  14  is  decreased. 
Similar  observations  can  be  made  for  the  remaining  boost¬ 
ing  rounds. 


What  is  more  interesting  to  observe  is  that  BSSD  does 
not  overemphasize  the  noisy  examples,  as  evidenced  by  the 
shared  sampling  weights  associated  with  these  examples. 
That  is,  BSSD  places  less  mass  of  sampling  probability  on 
these  difficult  examples,  with  the  exception  of  example  14. 
The  reason  is  that  when  a  view  competes  and  wins,  it  con¬ 
tributes  its  piece  of  information  about  an  event  to  the  boost¬ 
ing  process  by  forcing  losing  views  to  accept  its  interpreta¬ 
tion  of  the  event  through  shared  sampling  probability.  And 
as  such,  the  winning  view  potentially  makes  “corrections” 
to  sampling  probability  resulting  from  overcommitment  by 
the  losing  views. 

For  example,  when  Petal  (second)  view  misclassifies  ex¬ 
ample  5  (one  of  its  noisy  examples),  it  increases  its  weight. 
However,  when  Sepal  (first)  view  competes  and  wins,  it  re¬ 
duces  the  sampling  weight  for  example  5,  because  example 
5  is  not  “difficult”  as  far  as  Sepal  is  concerned.  This  can 
be  seen  from  boosting  round  3  to  round  4.  As  long  as  the 
views  do  not  share  the  same  noisy  examples  (this  assump¬ 
tion  is  reasonable  in  practice,  because  information  for  each 
view  is  obtained  from  independent  sources),  this  mechanism 
of  alternating  winning  views  plays  the  role  in  “softening” 
re-sampling  weights  for  difficult  or  noisy  examples,  thereby 
making  BSSD  more  robust  against  noise. 

We  can  state  this  formally  in  the  following  lemma. 


Lemma  1  Let  j  represent  the  view  among  M  views  that  has 
the  least  amount  of  noise  (e.g.,  it  has  the  fewest  noisy  exam¬ 
ples)  or  the  best  representation  in  terms  of  class  separability. 
When  the  view  j  wins  in  the  learning  process  of  BSSD,  the 
sampling  weights  for  noisy  examples  other  than  those  asso¬ 
ciated  with  the  jth  view  will  be  decreased. 

Proof.  Let  the  margin  of  the  training  example  z,;  =  (xi,yf) 
be  di  =  yi  Ct  h${x*),  where  c%  =  wk+i(zi ) 

Z  =  1  ai 

can  be  written  as  [18] 


Wk+l(Zi) 


eccp(—  1“! 

E  'Uexpi-^r 


(i) 


where  |a|  =  Et=i  \at  I-  As  Q  increases,  examples  with 
smaller  margin  (e.g.,  difficult  or  noisy  examples)  will  re¬ 
ceive  larger  sampling  weights.  Since  |a|  increases  at  least 
linearly  with  the  number  of  boosting  rounds  [18],  the  noisy 
examples  will  receive  larger  sampling  weights.  When  the 
jth  view  wins,  it  will  increase  the  sampling  weights  for  its 
noisy  or  difficult  examples,  while  decreasing  the  sampling 
weights  for  the  smaller  margin  examples  that  are  not  part 
of  the  jth  view.  The  jth  view  winning  will  happen  dur¬ 
ing  the  learning  process  because  the  jth  view  has  the  fewest 
noisy  examples,  thus  an  overall  large  margin  or  smaller  av¬ 
erage  error,  where  the  rate  of  error  for  z,;  is  given  by  [18], 

err(zi)  =  Et=i  c*tI(dh  ^  K(x*))  =  U1  ~  ^  Here  7(’) 
denotes  the  indicator  function.  ■ 

While  the  lemma  appears  to  set  stronger  conditions, 
BSSD  performs  much  better  in  practice  than  the  conditions 
required  by  the  lemma.  For  example,  the  above  Iris  data 


experiment  shows  that  even  when  the  views  have  the  same 
percentage  of  noise,  BSSD  will  outperform  AdaBoost,  as 
long  as  the  views  do  not  share  the  exact  noisy  examples. 

When  the  resulting  boosted  classifier  is  applied  to  the 
original  data  with  noise  removed,  80%  accuracy  is  obtained. 
For  comparison,  we  ran  AdaBoost  for  50  iterations  on  the 
same  data,  where  the  noisy  examples  are:  5,  8,  9,  12,  16  and 
18  (e.g.,  they  are  identical  to  those  in  the  second  view  for 
BSSD).  AdaBoost  achieved  an  accuracy  of  70%.  The  right 
column  in  Figure  2  shows  Adaboost  re-sampling  weights  for 
the  first  five  iterations.  As  expected,  AdaBoost  consistently 
placed  more  mass  of  sampling  probability  on  the  noisy  ex¬ 
amples,  resulting  in  less  accurate  performance. 

4  Error  Bounds 

4.1  Tighter  Bound  on  Training  Error 

Freund  and  Schapire  [14]  define  the  margin  of  the  train¬ 
ing  example  (a:*,  yf)  as 

a  _  ViF(xi)  eV~,  \  _  a 

Efc=r  ak  k=1 

Lemma  2  Given  the  weighted  training  errors  at  iteration  k 
for  hypotheses  hJk  corresponding  to  M  views  ek,  ■  ■  -  ,  ekT, 
denoting  ek  =  mirij{el}  and  9i  the  margins  of  the  training 
examples  (Xi,yf).  Then  for  an  ensemble  of  classifiers  that 
fuses  M  distinct  views,  the  bound  on  the  training  error  is 
given  by 


classifiers  are  chosen,  the  function  f(x),  as  defined  above, 
clearly  belongs  to  the  convex  hull  C  of  TL.  C  is  the  set  of 
mappings  that  can  be  generated  by  taking  a  weighted  aver¬ 
age  of  hypotheses  from  TL:  C=  {/  :  x  — >  Ylh  ahh(x)\ah  > 

0;  Thh ah  = 

Throughout  the  rest  of  the  paper  the  notation  P(x,y)~w[A} 
will  mean  the  probability  of  the  event  A  when  the  exam¬ 
ple  (x,y)  is  sampled  according  to  W,  and  P(x  ,w)~s[^]  will 
mean  the  probability  with  respect  to  sampling  uniformly  at 
random  an  example  from  the  training  set.  Their  abreviation 
will  be  P^y[A]  and  I  f  [-4]  •  The  expected  value  will  be  de¬ 
noted  E\y  [A]  and  Eg  [A] . 

Theorem  3  ([15])  Let  S  be  a  sample  ofn  examples  chosen 
independently  at  random  according  to  W.  Assume  that  the 
base  hypothesis  space  IT  has  the  VC-dimension  d  and  let 
5  >  0.  Then  with  probability  at  least  1  —  5  over  the  random 
choice  of  the  training  set  S,  every  weighted  average  function 
f  £  C  satisfies  the  following  bound  for  all  6  >  0 

1  [  dlog 2  (  )  1 

pw[yf(x)  <  0]  <  Pslvfix)  <0]  +  0(  —  J  - +  log(-)). 

s/n  \  6Z  8 

The  empirical  error  bound  for  the  shared  sampling  distri¬ 
bution  based  algorithm  for  fusion  of  base  classifiers  from  M 
views  is  provided  by  Theorem  4. 

Theorem  4  Given  the  weighted  training  errors  at  iteration 
kfor  hypotheses  corresponding  to  the  M  views  ek,  ■  ■  ■  ,  ek  1 
and  denoting  ek  =  mirij{e!}.  Then  for  any  9,  we  have  that 

Ps[yf(x)  <9]<  nfer[2v/ef-*(i  -C*)1+*]. 


,(H)<n^»*[2v/e*(l-e*)]- 


E 


Notice  that  efc(l  —  ek)  decreases  with  ek  for  ek  G  (0,0.5]. 
Therefore,  the  lemma  states  that  if  we  always  choose  the 
base  classifier  at  each  boosting  round  having  the  smallest 
error  rate  among  all  views  in  the  final  combined  classifier, 
we  can  reduce  training  error  faster.  This  potentially  pro¬ 
duces  a  final  combined  classifier  that  is  less  complex  due  to 
fewer  base  classifiers.  This  can  lead  to  better  generalization, 
especially  in  noise  free  data. 

4.2  Generalization  Error  Bound 

The  generalization  error  is  defined  as  the  probability  of 
misclassifying  a  new  example  [14].  Since  the  final  classifier 
computed  by  our  algorithm  is 


kmax 

H(x)  =  sign(F(x ))  =  sign{  ^  a%h%(x*)), 

k- 1 

the  final  classifier’s  output  will  not  be  affected  by  the  divi¬ 
sion  of  F(x)  by  a  positive  quantity,  namely  Ylk=T  ak ■  ^et 

us  define  f(x)  =  F(x)/  YX=T  « l ■ 

The  generalization  error  bound  for  our  algorithm  is  a  gen¬ 
eralization  of  the  AdaBoost  error  bound  for  multiple  views. 
The  proof  of  our  bound  follows  the  lines  of  that  introduced 
by  Schapire  et  al  [15].  Given  TL  the  space  where  the  base 


More  recent  results  [19]  give  the  following  error  bound 

PWlyfM<  0]  <  inf  {Pgfo/M  <»]  +  -,/- 

G  (0 , 1J  0  V  n 

[loglog2(29-l)  V  |  log  |  +  2 
+  ^  - n - >+  - ^ - 

where  C  is  a  constant.  This  bound  slightly  improves  the 
bound  established  in  Theorem  3,  and  will  be  used  later  in  our 
convergence  analysis.  Notice  that  e]f9 (\—ek)l+e  decreases 
with  decreasing  ek,  when  ek  £  (0,  0.5]  for  any  given  9  > 
0.  Thus  for  a  fixed  sample  size  n,  BSSD  achieves  better 
generalization  performance. 

5  Experimental  Evaluation 

We  have  carried  out  experimental  studies  evaluating  the 
performance  of  the  proposed  data  fusion  algorithm  on  a 
number  of  data  sets.  The  following  competing  methods  are 
compared. 

•  The  BSSD  algorithm  with  the  Naive  Bayes  classifiers 
[20]  as  base  classifiers  for  boosting.  The  Gaussian 
model  is  used  for  the  marginal  distributions  in  the 
Naive  Bayes  classifier. 

•  The  boosting  with  independent  sampling  distribution 
(BISD)  with  Naive  Bayes  classifiers.  This  algorithm  is 
a  variant  of  BSSD,  where  re-sampling  weights  of  train¬ 
ing  examples  are  independent  for  each  view. 


•  The  stacking  (Stacking)  algorithm  [11]  with  SVMs  as 
a  back-end  generalizer. 

•  The  semidefinite  programming  (SDP)  algorithm  [12], 
where  kernel  functions  along  each  view  can  be  Gaus¬ 
sian:  k(x,y)  =  exp(—\\x  —  y\\2/a2),  polynomial: 
k(x,  y)  =  (x-  y  +  l)2,  or  linear:  k(x,  y)  =  x  ■  y.  Here 
•  denotes  dot  product.  We  wish  to  thank  Lanckriet  for 
providing  us  with  Matlab  code  for  SDP. 

•  The  majority  vote  (MV)  algorithm,  where  SVMs  are 
used  as  component  classifier  along  each  view. 

Ten-fold  cross-validation  was  used  for  model  selection  (or 
choosing  procedural  parameters).  For  SVMs,  two  procedu¬ 
ral  parameters:  er  and  C  ,  the  soft  margin  parameter,  take 
values  in  [10-2, 102]  and  C  in  [10-2, 102] ,  respectively.  All 
results  reported  here  are  averaged  over  20  runs,  where  each 
run  splits  data  into  60%  training  and  40%  testing. 


Figure  3:  Sample  images  from  FERET  facial  images 
database. 

5.1  Experimental  Data 

Two  real  examples  are  used  to  evaluate  the  proposed  tech¬ 
nique  and  its  competitors. 

5.1.1  FERET  Facial  Image  Data 

Three  binary  class  data  sets  have  been  generated  from  the 
FERET  database  of  facial  images  [2].  The  three  classifica¬ 
tion  problems  are  (1)  Face  classification,  (2)  Gender  classi¬ 
fication,  and  (3)  detection  of  Glasses  (spectacles)  on  faces. 
Sample  images  from  the  FERET  database  are  shown  in  Fig¬ 
ure  3. 

For  the  face  and  gender  classification  experiments,  each 
image  is  represented  by  three  views  in  terms  of  eigenfaces 
extracted  from  three  head  orientations  (poses):  (1)  frontal, 
(2)  half  left  profile  and  (3)  half  right  profile.  The  non-face 
images  in  the  face  classification  data  set  are  blacked  out 
faces.  In  the  glass  detection  experiment,  each  image  is  rep¬ 
resented  by  three  types  of  features  extracted  from  only  one 
pose  of  an  individual,  namely  (1)  eigenfaces,  (2)  Canny  filter 
detected  edges  [21],  and  (3)  wavelet  coefficients  [22].  Each 
dataset  has  101  pictures  and  the  number  of  dimensions  after 
applying  principal  component  analysis  is  101  for  each  view. 


5.1.2  CYGD  Genomic  Data 

The  third  data  set  is  the  yeast  genomic  data  that  can 
be  obtained  from  the  MIPS  Comprehensive  Yeast  Genome 
Database  (CYGD)  [3],  The  task  consists  of  combining  dif¬ 
ferent  data  sources  for  gene  classification  (membrane  vs 
non-membrane  proteins).  There  are  three  sources  of  data- 
considered  “views”  in  our  framework-that  are  derived  from 
BLAST  and  Smith-Waterman  genomic  methods,  and  from 
gene  expression  measurements.  The  dataset  has  100  exam¬ 
ples  and  the  number  of  dimensions  after  applying  principal 
component  analysis  to  each  of  the  three  views  is  76,  74  and 
64,  respectively.  These  dimensions  explain  90%  variance  in 
the  data. 


Table  1 :  Results  for  face  classification 


Method 

Ay i 

Ay2 

Ay3 

-Afus 

Sig. 

MV 

0.709 

0.709 

0.700 

0.700 

yes 

Stack 

0.709 

0.709 

0.700 

0.717 

yes 

SDP 

Gauss 

Gauss 

Gauss 

0.698 

yes 

SDP 

Poly 

Gauss 

Gauss 

0.698 

yes 

SDP 

Poly 

Linear 

Gauss 

0.698 

yes 

BISD 

0.62 

0.60 

0.58 

0.750 

no 

BSSD 

0.623 

0.615 

0.564 

0.763 

Results  for  gender  classification 


MV 

0.555 

0.490 

0.549 

0.539  yes 

Stack 

0.555 

0.490 

0.549 

0.578  yes 

SDP 

Gauss 

Gauss 

Gauss 

0.444  yes 

SDP 

Poly 

Gauss 

Gauss 

0.450  yes 

SDP 

Poly 

Linear 

Gauss 

0.442  yes 

BISD 

0.64 

0.53 

0.56 

0.859  no 

BSSD 

0.633 

0.538 

0.578 

0.865 

Results  for  glass  detection 


MV 

0.560 

0.613 

0.528 

0.576  yes 

Stack 

0.560 

0.613 

0.528 

0.672  yes 

SDP 

Gauss 

Gauss 

Gauss 

0.457  yes 

SDP 

Poly 

Gauss 

Gauss 

0.439  yes 

SDP 

Poly 

Linear 

Gauss 

0.477  yes 

BISD 

0.57 

0.56 

0.56 

0.720  no 

BSSD 

0.591 

0.594 

0.546 

0.740 

5.2  Experimental  Results 

Tables  1  and  2  show  the  average  results  registered  by  the 
competing  methods  on  the  tasks.  The  average  accuracy  of 
individual  classifiers  from  each  view  before  fusion  is  shown 
in  the  columns  AVl ,  Ay2  and  Ay3  for  the  facial  data  (Table 
1)  and  in  the  columns  Ab ,  Asw  and  .1  a  for  the  genomic 
data  (Table  2).  Also,  the  paired  /-test  with  a  95%  confi¬ 
dence  level  was  performed  to  determine  if  the  difference  in 
performance  between  BSSD  and  the  competing  techniques 
(Stacking,  MV,  SDP  and  B1SD)  is  statistically  significant, 
and  is  shown  in  the  last  column. 


These  results  demonstrate  that  our  proposed  fusion  al¬ 
gorithm  significantly  outperforms  the  competing  techniques 
(except  BISD,  where  no  significant  difference  is  observed) 
on  these  noise  free  problems  that  we  have  experimented 
with.  We  note  however  that  BSSD  does  outperform  BISD 
on  the  Yeast  genomic  dataset.  In  particular,  our  simple  tech¬ 
nique  registered  superior  performance  over  mathematically 
sophisticated  techniques  such  as  SDP.  We  will  explore  later 
mathematical  arguments  behind  this. 


Table  2:  Results  for  Genomic  Data 


Methods 

Ab 

ASW 

Ag 

Afus 

Sig. 

Stack 

0.62 

0.58 

0.59 

0.63 

no 

MV 

0.62 

0.58 

0.59 

0.638 

no 

SDP 

Gauss 

Gauss 

Gauss 

0.60 

yes 

BISD 

0.50 

0.54 

0.52 

0.60 

yes 

BSSD 

0.52 

0.55 

0.54 

0.65 

Notice  that  in  some  examples,  the  average  combined  ac¬ 
curacy  is  worse  than  that  obtained  from  a  single  view  (i.e., 
majority  vote  and  stacking).  In  the  case  of  the  majority  vote, 
if  the  majority  makes  a  wrong  decision,  so  does  the  com¬ 
bined  decision,  resulting  in  a  decrease  in  accuracy.  Simi¬ 
lar  observations  were  made  in  [23,  15].  For  stacking,  the 
average  fusion  accuracy  is  obtained  by  the  (stacked)  classi¬ 
fier  whose  input  is  class  labels  generated  by  the  component 
classifiers  along  each  view.  If  the  component  classifiers  pro¬ 
duce  poor  results,  the  input  to  the  stacked  classifier  will  be 
noisy  or  poorly  represented.  This  fact  helps  explain  why  the 
stacked  accuracy  is  worse  than  the  classification  accuracy  of 
some  of  the  component  classifiers. 

5.3  Robustness  against  Noise 

Robustness  against  noise  is  a  key  feature  for  any  data  fu¬ 
sion  algorithm  that  must  operate  across  the  full  range  of  con¬ 
ditions  and  scenarios  the  system  is  anticipated  to  encounter. 
In  Tables  3,  4  and  5  we  compared  the  robustness  of  BSSD 
and  the  competing  techniques  against  noise  (average  fusion 
accuracies  over  20  runs).  We  randomly  added  noise  to  the 
class  label  of  the  training  data  sets  on  two  or  all  three  views 
at  various  levels:  10%,  20%  and  30%  by  ’’flipping”  the  label 
from  one  class  to  another.  Flipping  labels  to  generate  noise 
produces  similar  effect  to  that  produced  by  poor  representa¬ 
tions  (features). 

In  the  case  of  noisy  views,  noise  is  encoded  in  the  base 
kernel  matrices  in  SDP,  which  in  turn  severely  degrades  its 
performance.  On  the  other  hand,  our  approach  simply  relies 
more  on  other  available  data  sources  if  one  of  them  is  not 
reliable.  In  fact,  our  technique  prefers  views  that  are  better 
represented  in  terms  of  class  separability.  Recall  that  at  each 
iteration,  the  resampling  and  weight  update  in  BSSD  are  per¬ 
formed  using  a  shared  distribution.  That  is,  the  weights  for 
all  views  of  a  given  training  example  are  updated  according 
to  the  opinion  of  the  winning  classifier  (having  the  smallest 
average  training  error).  This  winning  classifier  is  unlikely  to 


Table  3:  Robustness  of  BSSD  vs  MV 


Data 

Set 

Noise 

2  Views 

MV 

BSSD 

Stat. 

Signif 

Afus 

-Afus 

Face 

10% 

0.70 

0.75 

yes 

20% 

0.71 

0.75 

yes 

30% 

0.70 

0.74 

yes 

Gender 

10% 

0.57 

0.75 

yes 

20% 

0.58 

0.69 

yes 

30% 

0.58 

0.66 

yes 

Glass 

10% 

0.59 

0.73 

yes 

20% 

0.58 

0.70 

yes 

30% 

0.58 

0.67 

yes 

Data 

Set 

Noise 

3  Views 

MV 

BSSD 

Stat. 

Signif 

Afus 

Afus 

Face 

10% 

0.70 

0.74 

yes 

20% 

0.72 

0.73 

no 

30% 

0.69 

0.70 

no 

Gender 

10% 

0.55 

0.71 

yes 

20% 

0.57 

0.61 

yes 

30% 

0.57 

0.54 

no 

Glass 

10% 

0.60 

0.73 

yes 

20% 

0.60 

0.67 

yes 

30% 

0.59 

0.61 

no 

come  from  a  view  that  either  is  poorly  represented  or  pro¬ 
vides  little  information  about  the  two  classes.  As  a  result, 
our  algorithm  relies  heavily  on  data  sources  that  best  sep¬ 
arate  the  two  classes,  which  is  important  for  building  the 
optimal  linear  combination  of  base  classifiers. 

We  state  that,  similar  to  [12],  cross-validation  was  not 
used  to  choose  kernel  parameters  for  SDP  in  our  experi¬ 
ments.  The  idea  is  that  SDP  learns  the  optimal  coefficients, 
1 1  of  kernel  matrices  that  determine  how  much  each  view 
contributes  to  the  final  decision  and  to  encourage  diversity. 
It  is  highly  likely  that  cross-validated  kernel  parameters  will 
help  increase  accuracy  in  classification.  We  suspect  that  this 
may  be  one  of  the  causes  for  the  sub-optimal  performance 
of  SDP  on  2  out  of  3  facial  tasks.  We  will  have  more  to  say 
later  regarding  SDP  in  the  “Discussions”  section. 

5.4  Boosting  Majority  Vote  and  Feature  Con¬ 
catenation 

Most  of  the  experiments  previously  described  are  con¬ 
cerned  with  the  majority  vote  algorithm  having  SVMs  along 
each  view  as  component  classifier  (expert).  An  alternative 
is  to  have  AdaBoost  as  expert  along  each  representation.  It 
may  well  be  the  case  where  the  ensemble  mechanism  can 
boost  the  performance  of  majority  vote.  In  this  experiment, 
we  use  the  facial,  texture  and  genomic  data  to  examine  ma¬ 
jority  vote  with  AdaBoost  as  expert  (AdaBoost-MV).  As  a 
reference,  we  also  run  AdaBoost  in  a  concatenated  feature 
space,  where  the  features  from  each  view  form  a  single, 


Table  4:  Robustness  of  BSSD  vs  Stacking 


Table  5:  Robustness  of  BSSD  vs  SDR 


Data 

Set 

Noise 

2  Views 

SDP 

BSSD 

Stat. 

Sig 

-Afus 

Afus 

Face 

10% 

0.70 

0.75 

yes 

20% 

0.70 

0.75 

yes 

30% 

0.70 

0.74 

yes 

Gender 

10% 

0.46 

0.75 

yes 

20% 

0.49 

0.69 

yes 

30% 

0.46 

0.66 

yes 

Glass 

10% 

0.56 

0.73 

yes 

20% 

0.52 

0.70 

yes 

30% 

0.52 

0.67 

yes 

Data 

Set 

Noise 

3  Views 

SDP 

BSSD 

Stat. 

Sig 

Afus 

-Afus 

Face 

10% 

0.70 

0.74 

yes 

20% 

0.70 

0.73 

yes 

30% 

0.70 

0.70 

no 

Gender 

10% 

0.46 

0.71 

yes 

20% 

0.49 

0.61 

yes 

30% 

0.46 

0.54 

yes 

Glass 

10% 

0.56 

0.73 

yes 

20% 

0.52 

0.67 

yes 

30% 

0.52 

0.61 

yes 

Data 

Noise 

Stacking 

BSSD 

Stat. 

Set 

2  Views 

Afus 

Afus 

Signif 

10% 

0.70 

0.75 

yes 

Face 

20% 

0.70 

0.75 

yes 

30% 

0.68 

0.74 

yes 

10% 

0.52 

0.75 

yes 

Gender 

20% 

0.53 

0.69 

yes 

30% 

0.53 

0.66 

yes 

10% 

0.60 

0.73 

yes 

Glass 

20% 

0.60 

0.70 

yes 

30% 

0.57 

0.67 

yes 

Data 

Noise 

Stacking 

BSSD 

Stat. 

Set 

3  Views 

Afus 

-Afus 

Signif 

10% 

0.71 

0.74 

yes 

Face 

20% 

0.70 

0.73 

no 

30% 

0.69 

0.70 

no 

10% 

0.51 

0.71 

yes 

Gender 

20% 

0.53 

0.61 

yes 

30% 

0.52 

0.54 

no 

10% 

0.59 

0.73 

yes 

Glass 

20% 

0.60 

0.67 

yes 

30% 

0.57 

0.61 

no 

concatenated  feature  space  (AdaBoost-Concatenated).  All 
the  three  methods,  AdaBoost-MV,  BSSD,  and  AdaBoost- 
Concatenated,  use  Naive  Bayes  [20]  as  base  classifiers  for 
boosting. 

While  AdaBoost-Concatenated  performs  the  worst,  due 
mainly  to  the  curse  of  dimensionality,  BSSD  stays  on  top, 
especially  in  noisy  environments.  It  is  clear  from  the  perfor¬ 
mance  of  both  BISD  and  AdaBoost-MV  that  the  shared  sam¬ 
pling  mechanism  employed  by  BSSD  once  again  demon¬ 
strates  distinct  advantage  in  dealing  with  noisy  data. 


6  Discussions 

Our  experiments  show  that  a  simple  method  like  BSSD 
consistently  outperforms  mathematically  sophisticated  ones 
such  as  SDP.  In  the  case  of  noisy  views,  noise  is  encoded 
in  the  base  kernel  matrices,  which  in  turn  severely  degrades 
its  performance.  Robustness  against  noise  however  is  a  fea¬ 
ture  that  is  essential  for  any  data  fusion  system  due  to  the 
very  nature  of  different  scenarios  where  such  systems  are 
employed.  For  instance,  in  a  multimodal  biometric  system 
it  may  not  be  always  possible  to  get  the  perfect  fingerprint 
of  an  individual  due  to  dust  being  accumulated  on  the  sensor 
or  a  scar  on  the  finger. 

In  the  noise  free  case,  we  try  to  explore  the  convergence 
rate  to  explain  the  apparent  performance  difference.  The 
SDP  generalization  error  bound  is  given  by  [12] 


—  ^2  max{l  -  yif(xi),0}  -\ - (4  + 


(2) 


where  C(/C)  =  Ea  maxjrgjc  crTKa  with  cft  €  {±1}2™ 
chosen  uniformly  randomly,  and  7  is  the  margin  parameter. 
Note  that  the  first  term  is  the  empirical  error,  while  the  sec¬ 
ond  term  represents  the  complexity.  Asymptotically  (i.e.,  as 
n  goes  to  infinity),  we  are  interested  in  the  complexity  terms 

d  ).  For  SDP, 


in  both  (2)  and  (2).  From  (2),  we  have  0(  y  n^n  / 

we  have  0(  Here  O  is  used  to  hide  all  logorithmic 

and  constant  factors.  In  [12],  it  is  shown  that  C(fC)  <  2 n2. 
The  following  lemma  shows  that  it  is  exactly  equal  to  2n2. 
Given  a  fixed  set  {K\, ...,  Km}  of  kernel  matrices,  define 

K  =  {K  =  E7=t  nK,  I K3  >  0,  N  e  m- 


Lemma  5  Let  1C  =  1C  U  {yyT\y  €  {±1}2".  We  have  the 
following  Ea  maxjfgx;  oTKo  =  2 n2. 

Proof.  Let  K  =  yyT  and  a  =  y.  The  result  follows 
immediately.  ■ 

Thus  the  complexity  term  for  SDP  becomes  0(1).  That 
is,  it  does  not  decrease  with  n.  A  similar  conclusion  is 
also  observed  in  [24],  On  the  other  hand,  0(^J ^2)  ap¬ 
proaches  0  when  n  goes  to  infinity,  given  everything  else 
being  fixed.  That  is,  BSSD  has  a  faster  convergence  rate 
than  SDP,  at  least  in  the  way  in  which  the  complexity  of  SDP 
is  measured.  This  may  explain  why  BSSD  has  superior  per¬ 
formance  over  SDP  on  the  data  sets  we  have  experimented 
with. 
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7  Summary 

We  have  presented  a  novel  data  fusion  technique  for  mul¬ 
timodal  learning.  We  have  provided  theoretical  analysis  for 
our  proposal  and  shown  empirically  that  our  technique  sig¬ 
nificantly  outperforms  several  competing  techniques  on  a 
number  of  classification  problems  and  is  very  robust  against 
noise,  which  is  essential  for  any  data  fusion  system  that  must 
operate  across  the  full  range  of  scenarios  the  system  is  ex¬ 
pected  to  encounter.  While  we  have  shown  experimentally 
that  shared  sampling  plays  a  key  role  in  robustness  against 
noise,  we  have  yet  to  provide  a  precise  mathematical  state¬ 
ment.  In  our  future  research,  we  plan  on  applying  advanced 
concentration  inequalities  to  provide  performance  bounds 
for  our  shared  sampling  technique  with  high  confidence. 
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